-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automated backport of #2743: Use leader election in globalnet controller #2759
Merged
tpantelis
merged 2 commits into
submariner-io:release-0.16
from
tpantelis:automated-backport-of-#2743-upstream-release-0.16
Oct 17, 2023
Merged
Automated backport of #2743: Use leader election in globalnet controller #2759
tpantelis
merged 2 commits into
submariner-io:release-0.16
from
tpantelis:automated-backport-of-#2743-upstream-release-0.16
Oct 17, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Due to timing of notifications during gateway node transition, it's been observed that there can be a small window of a few seconds where the controllers on the previously active gateway are still running when the controllers on the newly active gateway are started. This results in two instances of the controllers running simultaneously in the cluster which can cause strange behavior and inconsistent results. To alleviate this scenario, use leader election in the GatewayMonitor to ensure that the instance on the newly active gateway doesn't start its controllers until the previous instance has stopped its controllers and released the leader lock. There's a couple things to be aware of: - The K8s leader election functionality periodically renews the lease. If it can't be renewed prior to the RenewDeadline, it stops. For our usage, we don't have instances concurrently vying for leadership so we really don't need to keep renewing the lease. Ideally we would set the RenewDeadline very high to essentially disable it but it needs to be less than the LeaseDuration setting which we don't want too high. So we handle renewal failure by simply re-running leader election. - When transitioning to a non-gateway or shutting down, the GatewayMonitor cancels leader election which causes it to release the lock, which clears the HolderIdentity field and sets LeaseDuration to one second. This enables the next instance to quickly acquire the lock. However, if the controller instance crashes and doesn't properly release the lock, the next instance will have to await the LeaseDuration period before acquiring the lock. So we don't want LeaseDuration set too high and we don't want it too low either to give the current instance enough time to complete stopping its controllers. I picked 20 sec for LeaseDuration and 15 sec for RenewDeadline. So if the current instance takes longer than 20 sec to stop the controllers, it's still possible there could be two instance running concurrently but this scenario should be unlikely. Signed-off-by: Tom Pantelis <[email protected]>
If leadership is lost due to failure after the RenewDeadline expires, currently we try to re-acquire leadership but we don't stop the controllers. A failure can really only happen if the node loses contact with the API server and can't update the Lease resource. However, it's possible there could be a gateway transition during the time that leadership is lost and a new instance is able to contact the API server to acquire the lock (after the LeaseDuration expires). In this case, we'd have two controller instances running although the first is most likely effectively disabled anyway since most likely it can't contact the API server. However, if it regains contact, before it observes the gateway has transitioned, it could try to process globalnet resources and cause strange inconsistencies. This is an edge case but is avoided by stopping the controllers if leadership is lost due to failure, especially since the controllers are likely offline as well. However we don't clear the globalnet chains to avoid dataplane disruption. Signed-off-by: Tom Pantelis <[email protected]>
tpantelis
requested review from
Oats87,
skitt and
sridhargaddam
as code owners
October 17, 2023 13:13
🤖 Created branch: z_pr2759/tpantelis/automated-backport-of-#2743-upstream-release-0.16 |
sridhargaddam
approved these changes
Oct 17, 2023
skitt
approved these changes
Oct 17, 2023
🤖 Closed branches: [z_pr2759/tpantelis/automated-backport-of-#2743-upstream-release-0.16] |
tpantelis
deleted the
automated-backport-of-#2743-upstream-release-0.16
branch
October 24, 2023 12:23
dfarrell07
added
the
release-note-needed
Should be mentioned in the release notes
label
Oct 24, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
automated-backport
ready-to-test
When a PR is ready for full E2E testing
release-note-handled
release-note-needed
Should be mentioned in the release notes
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport of #2743 on release-0.16.
#2743: Use leader election in globalnet controller
For details on the backport process, see the backport requests page.